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These proceedings outline steps toward a systematic analysis of frontier energy collider data: specifically, those 
data collected at Tevatron Runs I and II, LEP Run II, HERA Runs I and II, and the future LHC. Algorithms 
designed to understand the gross features of the data (Vista), to systematically and model- independently search 
for new physics at the electroweak scale (Sleuth), to automate tests of specific hypotheses against those data 
(QuAERO), to turn an existing full detector simulation into a fast simulation (TurboSim), and to infer the 
physics underlying any hint observed in the data (Bard) are reviewed. A somewhat non-conventional viewpoint 
is adopted throughout. 
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1. Motivation 



While the standard model stands as a clear and 
successful description of nearly all experimental 
results to date, its consistent extension to ener- 
gies above the electroweak scale is a puzzle. A 
variety of new phenomena have been predicted at 
this scale, including (but certainly not limited to) 
magnetic monopoles, extra spatial dimensions, 
compositeness, new heavy gauge bosons, lepto- 
quarks, technicolor, supersymmetry, additional 
fermion generations, excited quarks and leptons, 
and a non-commutative space-time. 



From the high energy experimentalist's point 
of view, the range of possibilities is much wider 
yet, with each broad class of theories harboring a 
host of parameters whose values determine spe- 
cific phenomenological consequences. The min- 
imal supersymmetric extension to the standard 
model involves the introduction of a mere 105 free 
parameters. Performing a search in the data by 
scanning this parameter space is computationally 
intractable, so ad hoc assumptions are typically 
made to reduce the number of free parameters to 
two. 

Rather than starting from the somewhat di- 
rectionless guidance of theory, the keen experi- 
mentalist begins by examining the frontier energy 
data in their entirety, starting with an algorithm 
called Vista. 

2. Vista 

Vista, borrowed from Spanish and Italian, 
means "an extensive mental view," and involves 
the following steps. 

1. Define basic physics objects. Object crite- 
ria are applied to identify electrons (e^), 
muons (^t^), taus (t^), photons (7), jets 
(j), jets from a parent bottom quark (6), 
and missing energy ( ^) . 

2. Filter all high-pT events. At Tevatron 
Run 11, these are events containing an iso- 
lated and energetic electron, muon, or tau 
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with pt > 25 GeV, a photon with > 
50 GeV, a 6-jet or missing energy with 
Pt > 75 GeV, or a jet with pr > 100 GeV. 

3. Estimate ah backgrounds. MadEvent ^ 
is turned into a virtual collider, and 
the standard model contribution from all 
processes (with intelligent prescaling) are 
generated simultaneously, with systematic 
computation of millions of Feynman dia- 
grams. 

4. Simulate detector response. The time cost 
of generating a modestly complicated event 
at a frontier energy experiment is roughly 
100 seconds, taking the geometric mean of 
the experiments on the LEP, HERA, Teva- 
tron, and LHC rings. The construction of 
a fast simulation matching the accuracy of 
the existing simulations is desired but dif- 
ficult; a novel algorithm (TurboSim) is a 
potential solution. 

5. Introduce experimental fudge factors. Of- 
ten euphemistically referred to as scale fac- 
tors or correction factors, quantities like 
integrated luminosity, trigger efficiencies, 
and misidentification probabilities are de- 
termined by a global fit between the stan- 
dard model prediction and observed data. 
A simple version of Vista's misidentifica- 
tion matrix is illustrated in Tabled Rows 
represent true objects produced in the hard 
scattering; columns represent reconstructed 
objects observed in the detector. Each el- 
ement of the matrix gives the probabil- 
ity that the object corresponding to that 
row would be reconstructed as an object 
corresponding to that column; the diago- 
nal represents efficiencies, and off diagonal 
elements represent fake rates. Variation 
with energy and location in the detector is 
achieved by adding depth to the table, cor- 
responding to bins in energy and pseudora- 
pidity. 

6. Introduce theoretical fudge factors. So- 
called "k-factors," representing the differ- 
ence between the higher order calculation 



that cannot be performed and the leading 
order calculation that can, are fit simultane- 
ously with the experimental fudge factors. 





e 


M 


T 


7 


j 


b 


e 


0.91 




0.02 


le-3 


0.07 


le-3 


M 




0.87 










r 






0.10 




0.90 




7 








0.81 


0.19 




j 


le-4 


2e-6 


3e-3 


6e-4 


1 


2e-3 


b 


le-3 


le-3 


5e-3 


8e-4 


0.60 


0.40 



Table 1 



A cartoon illustration of Vista's misidentifica- 
tion matrix, incorporating some of the experi- 
mental fudge factors that are systematically fit 
through a global comparison of data to standard 
model prediction. Each element of the matrix 
represents the probability that the object label- 
ing that row will be (mis)identified in the detector 
as the type of object labeling that column. 



After these steps, the standard model predic- 
tion is compared globally to the observed data. 
Events are partitioned into exclusive final states 
characterized by the types of objects they con- 
tain. In each exclusive final state, the number of 
events observed in the data is compared to the 
number of events predicted from standard model 
processes, and the shapes of all relevant kine- 
matic distributions are compared using a simple 
Kolmogorov-Smirnov (KS) test. The scientific re- 
sult of Vista is a catalog of all gross discrepancies 
between the high energy data and the standard 
model prediction. No such catalog currently ex- 
ists. 

Side benefits of this approach include a com- 
plete estimation of all standard model back- 
grounds; a validation of the detector simulation, 
best achieved by directly comparing to data; a 
validation of the data, best achieved by directly 
comparing to standard model prediction; and 
a systematic determination of experimental and 
theoretical fudge factors. Simultaneously fitting 
for fudge factors also produces a complete corre- 
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lated error matrix, and hence a consistent global 
assignment of systematic errors. 

If the gross features of the data indicate some 
discrepancy that does not lend itself to interpre- 
tation in terms of experimental inadequacy, the 
result is published. If all gross features of the 
data are well described by the standard model 
prediction, attention is turned to those regions of 
the data that prejudice suggests are most likely to 
indicate the presence of new physics at the elec- 
troweak scale. Expecting small statistics signals, 
care must be taken to rigorously and without 
bias quantify the interestingness of any observed 
effect. The algorithm for doing this at fron- 
tier energy hadron colliders is Sleuth, a quasi- 
model-independent search strategy for new high- 
er physics. 

3. Sleuth 

Sleuth is based on the following three well- 
justified assumptions. 

• The data can be partitioned in such a way 
that a new signal will appear predominantly 
in one of these partitions. 

• New physics will appear at high pT- If 
new TeV-scale physics is produced in hard 
hadronic collisions, the outgoing particles 
will be energetic relative to the standard 
model and instrumental backgrounds. 

• New physics will appear as an excess of 
events. Deficits manifesting the complexity 

of quantum mechanics arc generally difficult 
to engineer without creating a correspond- 
ing (and more obvious) excess elsewhere. 

Sleuth involves three steps, following these 
three assumptions. 

• The data are partitioned into exclusive fi- 
nal states. The naive "exclusive" definition 
of these final states is slightly modified to 

increase the likelihood that a signal will ap- 
pear predominantly within a single bin. 

• Within each exclusive final state, a single 
variable is considered: the summed trans- 
verse momentum (^Pt) of all objects in 



the event. Any missing energy in the event 
is included in this sum if missing energy is 
a significant part of the final state. 

• Regions are defined in each final state by the 

semi-infinite intervals with lower bound at 
each data point in the distribution '^pt-^ 
The interestingness pn of an arbitrary re- 
gion containing N data points is defined as 
the Poisson probability that the integrated 
background with J2pt above the summed 
transverse momentum of the lowest of the 
data points would fluctuate up to or be- 
yond N. 

The most interesting region TZ is determined by 

the TV data points for which p^ is minimal. The 
fraction V of hypothetical similar experiments in 
which a region more interesting than TZ would 
be seen in this final state is determined by per- 
forming pseudo experiments. The fraction V of 
hypothetical similar experiments in which a re- 
gion more interesting than TZ would be seen in 
any flnal state is determined by performing addi- 
tional pseudo experiments. The fact that many 
different regions in the data have been considered 
is rigorously and explicitly accounted for in going 
from Pn to V. Sleuth, together with the vari- 
ation used by HI to perform a general search of 
HERA data, are the only algorithms currently on 
the market that perform this rigorous accounting. 
The rigorous computation of this trials factor is 
crucial to any prescription for conducting a data- 
driven search for new physics. 

The inputs to Sleuth arc estimated back- 
grounds and observed data. The outputs are the 
most interesting region TZ observed in those data, 
in the form of a specific final state and a threshold 
in X]pt; and the number P, a rigorous measure 
of the interestingness of this region. If the data 
involve no new physics, T' will be some random 
number between zero and unity; if otherwise, we 
expect V to be small. 

Five standard deviations has become the de- 
fault threshold for discovery in our field. It is 

^ This simplifies the version of Sleuth used at D0 in Teva- 
tron Run I, where up to four variables were used in each 
final state, requiring use of Voronoi diagrams for the defi- 
nition of regions. 
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worth understanding in the context of Sleuth 
why this particular threshold has been adopted. 
Five standard deviations corresponds to a prob- 
ability of roughly 10~^. A large experiment like 
CDF houses over 100 graduate students, each of 
which makes on average one interesting plot per 
week for roughly two years. A signal of 5 stan- 
dard deviations thus corresponds to a probability 
of 10^^ X (100 graduate students) x (50 weeks 
per year) x (2 years) « 10"'^, roughly 3 standard 
deviations. The desire to see a 5(t effect is thus 
understood as a desire to see a 3cr effect after the 
number of places a signal could have appeared is 
accounted for. At LEP this was referred to as 
the "look-elsewhere effect" ; elsewhere the phrase 
"trials factor" is often used. Sleuth rigorously 
computes this trials factor, so the threshold for 
discovery in terms of Sleuth's V corresponds to 
V < 0.001. 

The claim that a random Scr observation 
equates to only Scr after the trials factor is ac- 
counted for can be tested. The top quark was 
observed at levels of roughly five standard devia- 
tions by the CDF and D0 experiments in Teva- 
tron Run I I2'3l, and its existence has been con- 
firmed with additional data in Tevatron Run II. 
Nearly everyone believes the top quark exists, 
but what odds would you be willing to bet on 
this? Among the several dozen colleagues who 
have participated in this conversation over the 
past year, the best odds obtained to date are from 
a former Tevatron spokesperson, who was willing 
to put up $1000 to my $1 ...corresponding to 
roughly Scr. 

Two questions arise at this point. The first is 
whether Sleuth will find nothing if there is noth- 
ing there to be found. The answer is yes by con- 
struction, because of the way in which Sleuth 
computes V. The second is whether Sleuth 
would find something if there were something 
there to be found. Although impossible to answer 
in general, an answer can be given for any spe- 
cific case. Studies described in Refs. |4l5l6l7l8l9| 
develop intuition for Sleuth's performance on 
different signals. 

Sleuth's evaluation of over thirty exclusive fi- 
nal states at D0 in Tevatron Run I yielded no 
evidence of new physics |4I5I6I7| . Hi's use of a 



similar algorithm [1(7 on data collected in HERA 
Run I highlights a potentially interesting signal 
in the jijv final state, with V = 0.04. It will be 
interesting to keep an eye on this final state in 
HERA Run II. 

4. Measurements and Searches 

The standard model currently contains 26 pa- 
rameters. We can take these to be the six quark 
masses m^, m„, m^, rric, m?,, and m^; the quark 
mixing (CKM) matrix in the Wolfenstein param- 
eterization using A, A, p, and 77; the six lepton 
masses mg, m^, m^., m,^^, m^^, and rrii,^; the lep- 
ton mixing (MNS) matrix with 612, 613, 623, and 
the CP-violating phase S; the three gauge cou- 
plings uem, ctwi and a^; the two gauge masses 
mw and rrih] and the strong CP-violating param- 
eter 9. 

Tevatron Run II can contribute to the mea- 
surement of six of these: m^, p, 77, aw, mw, 
and rrifi- The uncertainty on the top quark mass 
nit will drop from 5 GeV to 1-2 GeV over the 
next five years. Observation of Bs mixing will re- 
duce the uncertainty on the CKM parameter p, 
and to a lesser extent the uncertainty on rj. The 
forward-backward asymmetry of Z boson decay 
is in principle sensitive to the weak mixing an- 
gle sin 9w , and hence the weak coupling aw , but 
will contribute little to the world average. Better 
measurements of the W boson mass mw and the 
Higgs boson mass mh will be challenging, with 
large systematics to beat on the former and small 
statistics to beat on the latter. 

Two remarks are worth making in the spirit of 
this discussion. 

• In the context of the standard model, the 
discovery of the top quark in Tevatron 
Run I was less a discovery than simply a 
better measurement of the top quark mass 
rrit , which was already pinned down reason- 
ably well by precision electroweak measure- 
ments. In a similar way, the discovery of the 
Higgs boson at Tevatron Run II would be 
less a discovery than simply a better mea- 
surement of the Higgs boson mass m^, al- 
ready known to within a factor of two from 
precision electroweak measurements. 
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• All analyses are either a better measure- 
ment of one of the standard model's 26 fun- 
damental parameters, working within the 
context of Standard Model, or a direct or 
indirect search for new physics. 

Measurements of and m/j are frequently mis- 
understood as searches for the top quark and 
for the Higgs boson. Conversely, searches for 
new physics are often misunderstood as measure- 
ments: since the goal of measuring the top quark 
production cross section is to find a discrepancy 
with the standard model prediction that points 
to the presence of some unknown phenomenon, 
this measurement is more readily understood as 
a search for new physics. ViSTA and Sleuth pro- 
vide methods for searching for new physics in a 
model-independent and systematic way. Measur- 
ing the top quark cross section is thus best under- 
stood as a suboptimal way of searching for new 
physics. 

5. Publication of results 

High energy collider measurements are further 
obfuscated by the desire to translate them into 
quantities that can be measured "precisely." As 
an example of this, measurements of the W and Z 
boson production cross sections in pp collisions at 
^/s ~ 2 TcV arc frequently presented in terms of 
the ratio of these two cross sections by physicists 
noting that the fractional error on the resulting 
ratio is less than the error on either measurement 
individually. 

The point being missed is that the relevant 
"preciseness" is not fractional uncertainty in the 
quoted number, but rather the power of the re- 
sult to distinguish between the standard model 
and the way Nature actually behaves. Reducing 
the W and Z boson cross section measurements 
to their ratio makes sense only when publishing 
in a journal that restricts articles to ten ASCII 
characters. 

Indeed, it is hard to think of a poorer way to 
publish new scientific knowledge for the future 
testing of arbitrary new hypotheses than con- 
densing new results into a single number. We 
have nonetheless succeeded in doing so. An even 
poorer means of publishing new scientific knowl- 



edge for the future testing of arbitrary new hy- 
potheses is to show 95% confidence level exclu- 
sion contours for randomly chosen models of new 
physics. With the notable exception of exclu- 
sion plots in TO/i and in neutrino Am^ versus 
tan^ 6, which we really believe contain Nature at 
some non-trivial point, exclusion contours are in- 
herently confusing and basically useless. They 
are inherently confusing because it is very dif- 
ficult to determine exactly what model is being 
tested, together with all assumptions that have 
been made. They are basically useless because 
it is very difficult to tell what the data have to 
say about a model that happens to not lie on 
the two-dimensional parameter space considered. 
With the standard model extended as above to 
include massive neutrinos. Nature does not lie 
on any of the beyond-the-standard-model two- 
dimensional parameter spaces that have been pro- 
duced to date. 

Clearly what we require is a means of publish- 
ing the data in their full dimensionality. 

6. QUAERO 

An algorithm for doing this has been achieved: 
QuAERO (Latin for "I search for" or "I seek" ) has 
been used in an initial version to make a subset 
of D0 Run I data publicly available and is 
under development for Tevatron Run II, LEP Run 
II, HERA Run I, and the future LHC. 

The challenge that motivates the development 
of QuAERO is the high-level automation of high 
energy collider analyses. Achieving such automa- 
tion would address several common problems. 

• Going from a subset of understood data and 
their backgrounds to a statement about the 
underlying theory currently has a standard 
practice, but no prescription. This begets 
the reinvention of analysis tools and redis- 
covery of statistical techniques; the tun- 
ing of neural networks and support vec- 
tor machines to specific cases continues to 
consume substantial graduate student time. 
Personal optimization strategies produce 
results correspondingly difficult to check. 

• Experimental results are frequently "uncor- 
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Figure 1. The QuAERO web page under develop- 
ment for the current generation of colhder exper- 
iments. 



rected" in order to facilitate comparison 
with theory. Unfortunately the response 
of high energy collider detectors, naturally 
understood in terms of a Monte Carlo sim- 
ulation from partons to reconstructed ob- 
jects, is awkwardly inverted in all but the 
most trivial detectors. The natural place 
in which to make the comparison between 
the prediction of a hypothesis and what is 
observed in the data is at the level of the 
reconstructed four-vectors of final state ob- 
jects. 

• The combination of experimental results is 
hindered by the differences among proce- 
dures used to generate those results; the use 



of a common algorithm makes this combi- 
nation trivial. 

Ref. (12| contains a more provocative account of 
other motivating issues. 

A tool like Quaero is potentially useful be- 
cause high energy collider data are sufficiently 
rich, and the array of possible new phenomena 
sufficiently large, that is not possible to test all 
theoretical possibilities. A tool like Quaero 
is possible because the data themselves are rel- 
atively simple, storable as four- vectors of final 
state objects. 

The Quaero web page under development for 
the current generation of frontier energy experi- 
ments is shown in Fig. ^ A querying physicist 
provides the events her model predicts should 
be seen in the detector, either in the form of 
commands to an event generator like Pythia, 
or as a file with the parton level events them- 
selves. Quaero subjects these events to each 
experiment's detector simulation; partitions the 
events, standard model backgrounds, and data 
into exclusive final states, categorized according 
to the reconstructed objects in the events; selects 
a set of variables within each final state; chooses 
a binning within that variable space; computes a 
binned likelihood; combines results among differ- 
ent final states and among different experiments; 
and numerically integrates over systematic errors. 
A sampling of algorithmic detail is provided in 
Ref. m. 

7. TURBOSiM 

The time cost of existing detector simulations 
currently being used by the major experiments 
represents one of several complications to realiz- 
ing this systematic analysis scheme. Constructing 
and tuning individual parameterized simulations 
for each experiment requires substantial human 
time; any approach that does not make use of the 
significant effort already invested in each experi- 
ment's full simulation is suboptimal. 

This line of thought has led to the construc- 
tion of an algorithm called TuRBoSiM, a fast 
simulation that tunes itself to each experiment's 
full simulation. TurboSim uses fully simulated 
events to generate a gigantic lookup table, match- 
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Figure 2. TurboSim uses a table built from 
events that have been run through the full de- 
tector simulation to learn that this detector has 
a crack in the calorimeter at 77 « (left), and the 
non-trivial geometry of its muon system (right). 
The dark (red) histogram shows the distribution 
of events that have been run through the experi- 
ment's full detector simulation; the lighter (green) 
histogram shows the distribution of events run 
through TurboSim. 



ing one or more partons with zero or more recon- 
structed objects. This table, representing TuR- 
BoSim's knowledge of the full simulation, is then 
used to simulate any new event that is given to 
it. 

Present computing resources are such that > 
10^ events have been generated at each of the ma- 
jor experiments, giving rise to a lookup table in 
TurboSim that is on the order of several tens of 
millions of lines long. The resulting table has suf- 
ficiently fine granularity when supplemented with 
a simple interpolation. 

The faithfulness with which TurboSim repro- 
duces the full simulation is determined by apply- 
ing each to a new sample of events, partitioning 
the results into exclusive final states, and examin- 
ing differences in normalization and in the shapes 
of all relevant kinematic distributions. Commis- 
sioning work remains, but results so far are en- 
couraging. Figure El shows that TurboSim is 
able to "learn" about a crack at jryj = in the 
calorimeter of one of the frontier energy collider 
experiments, and the non-trivial geometry of the 



surrounding muon chambers. 
8. Bard 

All of the above fall short of the desired prod- 
uct: an algorithm that takes as input the current 
theory and new experimental data, producing as 
output a new textbook describing the new under- 
lying physical theory . . . and the experiments that 
should be performed next to resolve still unan- 
swered questions. Bard is the beginnings of such 
an algorithm, designed to weave a story behind 
any hint observed in frontier energy collider data. 

Bard takes a hint observed by ViSTA or 
Sleuth; uses MadGraph to generate all con- 
ceivable new perturbative Feynman diagrams rep- 
resenting possible signals explaining that ob- 
served hint, introducing new particles and pa- 
rameters as necessary; uses Quaero to fit for 
the best values of these introduced parameters 
for each diagram; and uses Quaero to rank each 
new diagram's success in providing an improved 
description of the data. Bard's output is thus 
an ordered list of possible diagrammatic explana- 
tions, new particles and best fit parameters (cou- 
plings and masses), together with a measure of 
how much better that signal explains the data 
than the standard model alone. 



9. Summary 

The clarity of the standard model and ambigu- 
ity in its extension suggests a potentially fruitful 
modifcation to the current approach of analyzing 
high energy collider data. These procedings have 
described several algorithms in this spirit. ViSTA 
enables an extensive mental view of the data in 
their entirety, consistently understood in terms of 
the standard model prediction and systematically 
assigned fudge factors. The goal of ViSTA is to 
fail to obtain such a consistent global understand- 
ing, suggesting the presence of new large cross 
section physics. If a consistent understanding of 
the gross features of the data is achieved with 
Vista, new low cross section physics expected 
at or above the electroweak scale is searched for 
in a model-independent way using Sleuth, be- 
ing careful with the statistics of small signals. 
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The publication and testing of specific hypotheses 
against data globally understood through ViSTA 
and Sleuth is facilitated by Quaero, an algo- 
rithm that automates high energy collider analy- 
ses, allowing as a side effect a qualitatively new 
medium for publishing high energy collider data. 
The practical implementation of ViSTA, Sleuth, 
and QuAERO is facilitated by TurboSim, which 
tunes itself to an existing full detector simula- 
tion by constructing a large lookup table, reduc- 
ing the time cost for simulating events by roughly 
three orders of magnitude. Interpreting a hint 
seen by ViSTA or Sleuth in terms of the under- 
lying physical theory is the goal of Bard, which 
systematically considers possible perturbative ex- 
planations and uses QuAERO to check their ex- 
planatory power. 

The application of these ideas to frontier en- 
ergy collider data is an ongoing effort. It will be 
interesting to see what we see. 
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